First, we load a few R packages

Motivation

Let’s say you have a set of \(n\) observations \(X_i = (X_{i1},...,X_{ip})\) for \(i = (1, \ldots, n)\) each with \(p\) features. This data may or may not be “high-dimensional” (or \(p>>n\)). Depending on your goal in the analysis, you will find dimensionality reduction methods can be useful in some of the following scenarios (not mutually exclusive):

  1. Exploratory data analysis

As we have seen in class, visualizing data is one of the most important part of data science. The right visualization method may reveal problems with the data that can render the results from a standard analysis, although typically appropriate, completely useless. It can also help us make us important discoveries.

We have shown methods for visualizing univariate and paired data, but if we want to visualize relationships between columns or between rows in data that is high-dimensional, this can be complicated. For example, if each observation has \(p\)=100 features, we would have to create plots that reveal relationships between columns or between rows are more complicated due to the high dimensionality of data (e.g. what if there are strong correlations between the features?). For example, to compare each of the 100 features, we would have to create \({p \choose 2} = 4950\) scatter plots. Creating one single scatter plot of the data is not possible due to high dimensionality.

Methods for dimension reduction can be used to preserve important characterisitics in the data, such as the distance between features or observations, but with fewer dimensions, making data visualization (or exploratory data analysis) feasible.

We will consider different methods for dimension reduction, but the main approach we will discuss is called Principal Components Analysis. The goal here is to find a new set of multivariate variables that are uncorrelated and explain as much variance as possible.

For our purposes, EDA is the primary reason we will be using dimensionality reduction methods for this lecture.

  1. Data Compression

Here we are interested in finding the best matrix created with fewer variables (lower rank) that explains the original data.

In the figure below, we see the using the first principal component only really captures the difference between light and dark. As you increase the number of principal components, you see more detail in the image.

If you go far enough, you will recover an image that looks very similar to the original, but that is of a smaller dimension than the original dataset.

image source

  1. Improve classification (hopefully)

We will get to this until term 2, so stayed tuned!

Motivating example of twin heights

Here we use a data set with twin heights. We simulate \(n\)=100 two-dimensional (or \(p\)=2 features) points that represent heights of a pair of twins. For our purposes, let’s say we assume the number of features (\(p\)=2) is too large (or too high-dimensional) and we want to reduce the dimensions to \(p\)=1.

First, let’s simulate the heights of pairs of twins.

##          [,1]     [,2]
## [1,] 67.72362 67.32411
## [2,] 68.56875 70.20449
## [3,] 69.04952 68.48654
## [4,] 71.10088 72.11233
## [5,] 70.21862 68.46903
## [6,] 70.17676 69.69639

Here we will motivate methods for dimensionality reduction by showing a transformation which permits us to approximate the distance between two dimensional points with just one dimension.

Consider the first two observations (i.e. first rows) in our dataset X (red points).

The distance between those two points is

## [1] 3.001808

Or we can calculate the Euclidean distance between all observations pairwise using the dist() function.

##          2        3        4        5
## 1 3.001808 1.763315 5.859435 2.745157
## 2 0.000000 1.783949 3.170413 2.394554
## 3 1.783949 0.000000 4.165861 1.169231
## 4 3.170413 4.165861 0.000000 3.748604
## 5 2.394554 1.169231 3.748604 0.000000

Note, if we center the data by removing the average from both columns, we note the distance between the pair of twins do not change.

Here we use the scale() function which has a center and scale argument to remove the column mean and divide by the standard deviation for each column. You can also try the sweep() function for just centering or just scaling.

##          2        3        4        5
## 1 3.001808 1.763315 5.859435 2.745157
## 2 0.000000 1.783949 3.170413 2.394554
## 3 1.783949 0.000000 4.165861 1.169231
## 4 3.170413 4.165861 0.000000 3.748604
## 5 2.394554 1.169231 3.748604 0.000000

Ok, let’s start with the naive approach of simply removing one of the two dimensions. Let’s compare the actual distances to the distance computed with just one of the dimensions. The plot below shows the comparison to the first dimension (left) and to the second (right)

We see the actual distance is generally underestimated using only 1 of the 2 columns of heights. This is actually to expected since we are adding more things in the actual distance. If instead we average and use this distance,

\[d_{ij} = \sqrt{ \frac{1}{2} \sum_{p=1}^2 (X_{i,p}-X_{j,p})^2 }\] then we see the bias goes away

We also see there is a strong correlation. Can we pick a one dimensional summary that makes this correlation even stronger?

## [1] 0.97286

If we look back at the plot, and visualize a line between any pair of points, the length of this line is the distance between the two points. These lines tend to go along the direction of the diagonal. Notice that if we instead plot

We see almost all the variation can be explained by the Mean axes. This means that we can essentially ignore the second dimension and not lose too much information. If the line is completely flat, we lose no information. If we use this transformation of the data instead we get much higher correlation:

## [1] 0.9973466

Note that each row of \(X\) was transformed using a linear transformation.

If you are familiar with linear algebra we can write the operation we just performed like this:

\[ Z = X A \mbox{ with } A = \, \begin{pmatrix} 1/2&1\\ 1/2&-1\\ \end{pmatrix} \]

And that we can transform back by simply multiplying by \(A^{-1}\) as follows:

\[ X = Z A^{-1} \mbox{ with } A^{-1} = \, \begin{pmatrix} 1&1\\ 1/2&-1/2\\ \end{pmatrix} \]

Note: we can actually guarantee that the distance scales remain the same if we re-scale the columns of \(A\) to assure that the sum of squares are 1:

\[a_{1,1}^2 + a_{2,1}^2 = 1\mbox{ and } a_{2,1}^2 + a_{2,2}^2=1\]

and the correlation of the columns is 0. In this particular example to achieve this, we multiply the first set of coefficients (first column of \(A\)) by \(\sqrt{2}\) and the second by \(1\sqrt{2}\), then we get the same exact distance if we use both dimensions and a great approximation if we use both.

In this case \(Z\) is called an orthogonal rotation of \(X\): it preserves the distances between points.

Let’s formalize this a bit more.

Singular Value Decomposition (SVD)

The singular value decomposition (SVD) is a generalization of the algorithm we used in the motivational section. As in the example, the SVD provides a transformation of the original data. This transformation has some very useful properties.

The main result SVD provides is that we can write an matrix \(\mathbf{X}\) (of dimension \(m \times n\)) as

\[\mathbf{U}^\top\mathbf{X} = \mathbf{DV}^\top\]

With:

  • \(\mathbf{U}\) is an \(m \times p\) orthogonal matrix (left singular vectors)
  • \(\mathbf{V}\) is an \(n \times p\) orthogonal matrix (right singular vectors)
  • \(\mathbf{D}\) is an \(p \times p\) diagonal matrix (singular values)

with \(p=\mbox{min}(m,n)\). \(\mathbf{U}^\top\) provides a rotation of our data \(\mathbf{X}\) that turns out to be very useful because the variability of the columns of \(\mathbf{U}^\top \mathbf{X}\) (or \(\mathbf{DV}^\top\)) are decreasing. Because \(\mathbf{U}\) is orthogonal, we can write the SVD like this:

\[\mathbf{X} = \mathbf{UDV}^\top\]

Now we will apply a SVD to the motivating example we have using the svd() function in R:

## [1]   2 100

Note, we rotated X to put the (p=2) features along rows with n=100 observation along columns before applying svd().

The svd() command returns the three matrices and only the diagonal entries are returned for \(\mathbf{D}\).

## List of 3
##  $ d: num [1:2] 13.9 2.19
##  $ u: num [1:2, 1:2] -0.707 -0.707 -0.707 0.707
##  $ v: num [1:100, 1:2] 0.0499 -0.01324 0.00826 -0.08742 -0.01062 ...

To obtain the principal components from the SVD, we simply need the columns of the rotation \(\mathbf{U}^\top\mathbf{X}\):

## [1]   2 100

We can plot the \(n=100\) twin heights along the principal components

Alternatively, we could have used \(\mathbf{DV}^\top\):

Another way of calculating the PCs is to use prcomp() function.

## List of 5
##  $ sdev    : num [1:2] 1.4 0.22
##  $ rotation: num [1:2, 1:2] 0.707 0.707 0.707 -0.707
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "PC1" "PC2"
##  $ center  : num [1:2] -8.40e-18 1.97e-17
##  $ scale   : num [1:2] 3.1 2.97
##  $ x       : num [1:100, 1:2] -0.694 0.184 -0.115 1.215 0.148 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "PC1" "PC2"
##  - attr(*, "class")= chr "prcomp"

The output pc$x contains the principal components

The PCs are the same as the right singular vectors when they are centered and scaled. (signs of PCs are arbitrary)

Cool! So we have now reduced the \(p=2\) dimensions to 1 dimension that explains the most amount of variation.

OK, but how is this useful?

I will admit in our super simple example, it is not immediately obvious how incredibly useful the SVD can be. So let’s consider an example with more than two features. In this example, we will greatly reduce the dimension of \(\mathbf{V}\) and still be able to reconstruct \(\mathbf{X}\).

Motivating example with the Written Digits

When you write and mail a letter, the first thing that happens to the letter when it’s received in the post office is that they are sorted by zip code:

Originally humans had to sort these but today, thanks to machine learning algorithms, a computer can read zip codes. The data is available in the data repository on the course github page.

## [1] 5000  784

Here are three images of written digits.

What are the features?

Each image is converted into \(28 \times 28\) pixels and for each we obtain an grey scale intensity between 0 (white) to 255 (black). This means one image has 784 (=28*28) features.

We can see these values like this:

What are the outcomes?

So for each digit \(i\) we have an outcome \(Y_i\) which can be one of 10 categories: \(0,1,2,3,4,5,6,7,8,9\) and the features \(X_{i,1}, \dots, X_{i,784}\) which can take values from 0 to 255. We use bold face to denote this vector of predictors \(\mathbf{X}_i = (X_{i,1}, \dots, X_{i,784})\).

784 features a lot of features. I’m going to assume the pixels close together are going to be somewhat correlated. Is there any room for data reduction? Can we identify a reduced set of features that can explain the variation in the data?

Calculating the top PCs

If you recall, the first PC is will explain the most variation, the second PC will explain the second most variation in the data, etc.

Because the pixels are so small we expect those to be close to each other on the grid to be correlated, meaning that dimension reduction should be possible.

Let’s take the SVD of \(\mathbf{X}\). We only read in 5000 observations, so this shouldn’t take very long, but if you do this on the entire dataset, this will take a little while.

## [1] 5000  784

Remember, we need to column center the data. We also will create a new variable \(\mathbf{Y}\) to represent the standardized data that is also transposed (features along rows).

## [1]  784 5000

Now apply the svd() function to \(\mathbf{Y}\).

## List of 3
##  $ d: num [1:784] 40536 35266 32786 29761 29230 ...
##  $ u: num [1:784, 1:784] -2.78e-17 -1.11e-16 0.00 0.00 -1.11e-16 ...
##  $ v: num [1:5000, 1:784] -0.01161 0.00853 -0.02516 -0.01806 -0.01299 ...

First note that we can in fact reconstruct \(\mathbf{Y}\) using all the PCs:

## [1] 3.928804e-10

If we look at the eigenvalues in \(\mathbf{D}\), we see that the last few are quite close to 0.

This implies that the last columns of \(\mathbf{V}\) have a very small effect on the reconstruction of \(\mathbf{X}\). To see this, consider the extreme example in which the last entry of \(\mathbf{V}\) is 0. In this case the last column of \(\mathbf{V}\) is not needed at all.

Because of the way the SVD is created, the columns of \(\mathbf{V}\), have less and less influence on the reconstruction of \(\mathbf{X}\). You commonly see this described as “explaining less variance”. This implies that for a large matrix, by the time you get to the last columns, it is possible that there is not much left to “explain”.

As an example, we will look at what happens if we remove the 100 last columns:

## [1] 3.928804e-10

The largest residual is practically 0, meaning that Yhat is practically the same as Y, yet we need 100 less dimensions to transmit the information.

By looking at \(\mathbf{D}\), we can see that, in this particular dataset, we can obtain a good approximation keeping only a subset of columns. The following plots are useful for seeing how much of the variability is explained by each column:

We can also make a cumulative plot:

Although we start with 784 dimensions, we can approximate \(X\) with just a few:

Therefore, by using only half as many dimensions, we retain most of the variability in our data:

## [1] 0.9172635

We say that we explain 92 percent of the variability in our data with 100 PCs.

Note that we can compute this proportion from \(\mathbf{D}\):

## [1] 0.9172635

The entries of \(\mathbf{D}\) therefore tell us how much each PC contributes in term of variability explained.

Another way of calculating the PCs is to use prcomp() function.

The proportion of variance of the first ten PCs is quite high (almost 50%):

##                              PC1       PC2       PC3       PC4       PC5
## Standard deviation     573.32405 498.77957 463.70575 420.92107 413.40976
## Proportion of Variance   0.09574   0.07246   0.06263   0.05160   0.04978
## Cumulative Proportion    0.09574   0.16819   0.23082   0.28242   0.33220
##                              PC6       PC7       PC8       PC9      PC10
## Standard deviation     384.74711 336.96599 313.63881 308.84477 289.24143
## Proportion of Variance   0.04311   0.03307   0.02865   0.02778   0.02437
## Cumulative Proportion    0.37531   0.40839   0.43704   0.46482   0.48918

We can also plot the standard deviations:

or the more common plot variance explained:

We can also see that the first two PCs will in fact be quite informative. Here is a plot of the first two PCs:

We can also “see” the linear combinations on the grid to get an idea of what is getting weighted:

Other methods of dimensionality reduction

Linear methods

  1. Factor analysis.

This technique is best suited for situations where we have highly correlated set of variables. It divides the variables based on their correlation into different groups, and represents each group with a factor. Aims to identify latent sources of variation. Also typically tied to Gaussian distributions.

  1. Independent Component Analysis (ICA)

Similar to factor analysis, but relies on non-Gaussian assumptions. We can use ICA to transform the data into independentcomponents which describe the data using less number of components.

Non-linear methods

  1. Isometric feature mapping (ISOMAP)

We use this technique when the data is strongly non-linear.

  1. t-SNE

This technique also works well when the data is strongly non-linear. It works extremely well for visualizations as well, but distances between points are meaningless.

  1. UMAP

This technique works well for high dimensional data. The run-time is shorter as compared to t-SNE.

Resources and sources of material